Datasets

  1. CD-HIT reduction using 0.9 threshold
  2. Graph-part partitioning to create independent and train-test datasets. Parameters: 0.4 threshold, 0.15 ratio of validation dataset, no moving between clusters.
Dataset Before filtering After CD-HIT After partitioning
(train-test)
After partitioning
(independent)
N_IM 59 59 50 6
N_OM 59 56 46 4
N_TM 276 222 192 30
N_S 357 340 287 53
N_TL_SEC 49 43 37 4
N_TL_TAT 84 89 67 6
P_IM 187 128 106 11
P_TM 4456 1237 1073 156
P_S 1417 419 360 42

Choosing the best architecture

We use train-test dataset obtained after homology partitioning to perform 5-fold CV repeated 5 times. Based on the CV results we want to select the optimal architecture (highest mean kappa). We use repeated CV to reduce variance of performance measures and check architecture stability.

Problem 1: Differences between the models with highest mean kappa values are very small

Problem 2: Architectures with highest kappa values tend to have higher standard deviations of performance especially for the most problematic localizations (N_OM, N_IM)

Questions

How to select the best architecture? - select the one with the highest kappa regardless of standard deviations? Include standard deviations in the decision (how?)? use some other method?

Independent dataset - is that enough? Test it on proteins from other organisms not possesing ‘typical’ chloroplasts, such as Paulinella or Plasmodium (problem with data availability - will have to check the literature).

Jackknife - everyone else is doing that but it has its problems (tends to overestimate model performance, doesn’t work well with GLMs)